maximum margin classifier
Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures
Modern machine learning systems such as deep neural networks are often highly over-parameterized so that they can fit the noisy training data exactly, yet they can still achieve small test errors in practice. In this paper, we study this "benign overfitting" phenomenon of the maximum margin classifier for linear classification problems. Specifically, we consider data generated from sub-Gaussian mixtures, and provide a tight risk bound for the maximum margin linear classifier in the over-parameterized setting. Our results precisely characterize the condition under which benign overfitting can occur in linear classification problems, and improve on previous work. They also have direct implications for over-parameterized logistic regression.
Benign Overfitting for $\alpha$ Sub-exponential Input
This paper investigates the phenomenon of benign overfitting in binary classification problems with heavy-tailed input distributions. We extend the analysis of maximum margin classifiers to $\alpha$ sub-exponential distributions, where $\alpha \in (0,2]$, generalizing previous work that focused on sub-gaussian inputs. Our main result provides generalization error bounds for linear classifiers trained using gradient descent on unregularized logistic loss in this heavy-tailed setting. We prove that under certain conditions on the dimensionality $p$ and feature vector magnitude $\|\mu\|$, the misclassification error of the maximum margin classifier asymptotically approaches the noise level. This work contributes to the understanding of benign overfitting in more robust distribution settings and demonstrates that the phenomenon persists even with heavier-tailed inputs than previously studied.
Mistake, Manipulation and Margin Guarantees in Online Strategic Classification
Shen, Lingqing, Ho-Nguyen, Nam, Giang-Tran, Khanh-Hung, Kฤฑlฤฑnรง-Karzan, Fatma
Binary classification is a well-known problem in supervised learning, with applications in numerous important domains such as marketing, finance, natural language processing and medicine. The traditional binary classification problem aims to learn a decision rule that maps feature vectors to binary labels 1, with the aim of predicting a true underlying label for a feature vector. For example, features may correspond to identifying information of customers of a bank who apply for a loan, and in this context the label may indicate whether the bank will approve or deny the loan. The true underlying label, whether the bank should approve or deny given all future outcomes, is unknown at the time of the loan application, which necessitates the need to use a classification rule. Like the example above, binary classification is now regularly applied to various applications involving human agents. Customers obviously prefer that their loan application be approved rather than denied, and in many other applications there may similarly be one label that is preferred by agents over the other. One can imagine that in practice this leads to strategic behavior of agents, where feature vectors are manipulated in order to achieve the desired label prediction. Of course, there is often also a cost associated with manipulation, so agents may manipulate only if the payoff for achieving the desirable label is worth the cost.
Is Importance Weighting Incompatible with Interpolating Classifiers?
Wang, Ke Alexander, Chatterji, Niladri S., Haque, Saminul, Hashimoto, Tatsunori
Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. Is importance weighting truly incompatible with the training of overparameterized neural networks? Our paper answers this in the negative. We show that importance weighting fails not because of the overparameterization, but instead, as a result of using exponentially-tailed losses like the logistic or cross-entropy loss. As a remedy, we show that polynomially-tailed losses restore the effects of importance reweighting in correcting distribution shift in overparameterized models. We characterize the behavior of gradient descent on importance weighted polynomially-tailed losses with overparameterized linear models, and theoretically demonstrate the advantage of using polynomially-tailed losses in a label shift setting. Surprisingly, our theory shows that using weights that are obtained by exponentiating the classical unbiased importance weights can improve performance. Finally, we demonstrate the practical value of our analysis with neural network experiments on a subpopulation shift and a label shift dataset. When reweighted, our loss function can outperform reweighted cross-entropy by as much as 9% in test accuracy. Our loss function also gives test accuracies comparable to, or even exceeding, well-tuned state-of-the-art methods for correcting distribution shifts.
Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures
Cao, Yuan, Gu, Quanquan, Belkin, Mikhail
In modern machine learning, complex models such as deep neural networks have received increasing popularity. These complicated models are known to be able to fit noisy training data sets, while at the same time achieving small test errors. In fact, this benign overfitting phenomenon is not a unique feature of deep learning. Even for kernel methods and linear models, Belkin et al. (2018) demonstrated that interpolators on the noisy training data can still perform near optimally on the test data. A series of recent works (Belkin et al., 2019b; Muthukumar et al., 2020b; Hastie et al., 2019; Bartlett et al., 2020) theoretically studied how over-parameterization can achieve small population risk. In particular in Bartlett et al. (2020) the authors considered the setting where the data are generated from a ground-truth linear model with noise, and established a tight population risk bound for the minimum norm linear interpolator with a matching lower bound. More recently, Tsigler and Bartlett (2020) further studied benign overfitting in ridge regression, and established non-asymptotic generalization bounds for over-parametrized ridge regression. They showed that those bounds are tight for a range of regularization parameter values. Notably, these results cover arbitrary covariance structure of the data, and give a nice characterization of how the spectrum of the data covariance matrix affects the population risk in the over-parameterized regime.
Finite-sample analysis of interpolating linear classifiers in the overparameterized regime
Chatterji, Niladri S., Long, Philip M.
A surprising statistical phenomenon has emerged in modern machine learning: highly complex models can interpolate training data while still generalizing well to test data, even in the presence of label noise. This is rather striking as it the goes against the grain of the classical statistical wisdom which dictates that predictors that generalize well should trade off between the fit to the training data and the some measure of the complexity or smoothness of the predictor. Many estimators like neural networks, kernel estimators, nearest neighbour estimators, and even linear models have been shown to demonstrate this phenomenon (see, Zhang et al. 2017; Belkin et al. 2019, among others). This phenomenon has recently inspired intense theoretical research. One line of work (Soudry et al. 2018; Ji and Telgarsky 2019; Gunasekar et al. 2017; Nacson, Srebro, and Soudry 2019; Gunasekar et al. 2018a; Gunasekar et al. 2018b) formalized the argument (Neyshabur, Tomioka, and Srebro 2014; Neyshabur 2017) that, even when there is no explicit regularization that is used in training these rich models, there is nevertheless implicit regularization encoded in the choice of the optimization method used. For example, in the setting of linear classification, (Soudry et al. 2018; Ji and Telgarsky 2019; Nacson, Srebro, and Soudry 2019) show that learning a linear classifier using gradient descent on the unregularized logistic or exponential loss asymptotically leads the solution to converge to the maximum l
Convergence of SGD in Learning ReLU Models with Separable Data
Xu, Tengyu, Zhou, Yi, Ji, Kaiyi, Liang, Yingbin
We consider the binary classification problem in which the objective function is the exponential loss with a ReLU model, and study the convergence property of the stochastic gradient descent (SGD) algorithm on linearly separable data. We show that the gradient descent (GD) algorithm do not always learn desirable model parameters due to the nonlinear ReLU model. Then, we identify a certain condition of data samples, under which we show that SGD can learn a proper classifier with implicit bias. In specific, we establish the sub-linear convergence rate of the function value generated by SGD to global minimum. We further show that SGD actually converges in expectation to the maximum margin classifier with respect to the samples with +1 label under the ReLU model at the rate O(1/ln t). We also extend our study to the case of multi-ReLU neurons, and show that SGD converges to a certain non-linear maximum margin classifier for a class of non-linearly separable data.
Maximum margin classifier working in a set of strings
Koyano, Hitoshi, Hayashida, Morihiro, Akutsu, Tatsuya
Numbers and numerical vectors account for a large portion of data. However, recently the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem on the asymptotic behavior of a consensus sequence of strings, which is the counterpart of the mean of numerical vectors, as demonstrated in the probability theory on a metric space of strings developed by one of the authors and his colleague in a previous study [18]. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein--protein interactions using amino acid sequences.